Catching the Drift: Using Feature-Free Case-Based Reasoning for Spam Filtering

نویسندگان

  • Sarah Jane Delany
  • Derek G. Bridge
چکیده

In this paper, we compare case-based spam filters, focusing on their resilience to concept drift. In particular, we evaluate how to track concept drift using a case-based spam filter that uses a featurefree distance measure based on text compression. In our experiments, we compare two ways to normalise such a distance measure, finding that the one proposed in [1] performs better. We show that a policy as simple as retaining misclassified examples has a hugely beneficial effect on handling concept drift in spam but, on its own, it results in the case base growing by over 30%. We then compare two different retention policies and two different forgetting policies (one a form of instance selection, the other a form of instance weighting) and find that they perform roughly as well as each other while keeping the case base size constant. Finally, we compare a feature-based textual case-based spam filter with our feature-free approach. In the face of concept drift, the feature-based approach requires the case base to be rebuilt periodically so that we can select a new feature set that better predicts the target concept. We find feature-free approaches to have lower error rates than their feature-based equivalents.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Feature-Based and Feature-Free Textual CBR: A Comparison in Spam Filtering

Spam filtering is a text classification task to which CaseBased Reasoning (CBR) has been successfuly applied. We describe the ECUE system, which classifies emails using a feature-based form of textual CBR. Then, we describe an alternative way to compute the distances between cases in a feature-free fashion, using a distance measure based on text compression. This distance measure has the advant...

متن کامل

Using Case-Based Reasoning for Spam Filtering

Spam is a universal problem with which everyone is familiar. Figures published in 2005 state that about 75% of all email sent today is spam. In spite of significant new legal and technical approaches to combat it, spam remains a big problem that is costing companies meaningful amounts of money in lost productivity, clogged email systems, bandwidth and technical support. A number of approaches a...

متن کامل

A Case-Based Approach to Spam Filtering that Can Track Concept Drift

There are a few key benefits of a case-based approach to spam filtering. First, the many different sub-types of spam suggest that a local learner, such as Case-Based Reasoning (CBR) will perform well. Second, the lazy approach to learning in CBR allows for easy updating as new types of spam arrive. Third, the case-based approach to spam filtering allows for the sharing of cases and thus a shari...

متن کامل

Tracking Concept Drift at Feature Selection Stage in SpamHunting: An Anti-spam Instance-Based Reasoning System

In this paper we propose a novel feature selection method able to handle concept drift problems in spam filtering domain. The proposed technique is applied to a previous successful instance-based reasoning e-mail filtering system called SpamHunting. Our achieved information criterion is based on several ideas extracted from the well-known information measure introduced by Shannon. We show how r...

متن کامل

A case-based technique for tracking concept drift in spam filtering

Spam filtering is a particularly challenging machine learning task as the data distribution and concept being learned changes over time. It exhibits a particularly awkward form of concept drift as the change is driven by spammers wishing to circumvent spam filters. In this paper we show that lazy learning techniques are appropriate for such dynamically changing contexts. We present a case-based...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007